library(Hmisc)
library(tidyverse)Homework 1
Load Packages
Problem 1
Survey
August 29, 2024 at 9:44pm.
Campuswire
Insert the image you uploaded to Campuswire here.
Problem 2
Question 1
The study population of data set 1 is people in the UK that have experience with crime. The study population of data set 2 is crimes that were reported.
Question 2
In data set 1, the answers reported were voluntary and self reported. In data set 2, is convenient sampling because the records had been reported by the UK police.
Question 3
The first data set was a survey of people that are 16 years of age or older that are not living in communal residences (self reported). The second data set is recorded crimes by the UK police.
Question 4
The target population in data set 1 is how much crime rates in UK.
Question 5
In data set one, it is not very valid nor reliable due to self reporting and how answers could be biased. In data set 2, it is reliable and valid because the crimes have been reported and administrated by the UK police, which make them a valid and reliable source. The conclusion on data set 1 will not be generalizable because the self reports may not be the same for everyone. The conclusions for data set 2 can be generalizable and can make the conclusion whether crime rates have gone up or down, only if the police definition of crimes has not changed over time.
Problem 3
Question 1
The <- notation is equivalent to an = sign in R and is often used to declare variables. After running this code chunk, the named dataframe df appears in the environment on the right-hand side of RStudio.
df <- read_csv('https://www.openintro.org/data/csv/babies.csv')Rows: 1236 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): case, bwt, gestation, parity, age, height, weight, smoke
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 2
The notation Hmisc:: directly calls this function from the Hmisc package. describe() is a common function name, and sometimes this is needed to indicate to R which function from which package you want to use. The pipe feature |> sends the results of the first line directly into the function on the 2nd line and is a convenient way to chain functions together.
This code prints a useful and attractive summary of the data set we are using.
Hmisc::describe(df) |>
html()8 Variables 1236 Observations
case
n missing distinct Info Mean Gmd .05 .10 .25
1236 0 1236 1 618.5 412.3 62.75 124.50 309.75
.50 .75 .90 .95
618.50 927.25 1112.50 1174.25
lowest : 1 2 3 4 5 , highest: 1232 1233 1234 1235 1236
bwt
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1236 | 0 | 107 | 1 | 119.6 | 20.33 | 88.0 | 97.0 | 108.8 | 120.0 | 131.0 | 142.0 | 149.0 |
gestation
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1223 | 13 | 106 | 0.999 | 279.3 | 16.57 | 252.0 | 262.0 | 272.0 | 280.0 | 288.0 | 295.8 | 302.0 |
parity
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 1236 | 0 | 2 | 0.57 | 315 | 0.2549 | 0.3801 |
age
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1234 | 2 | 30 | 0.997 | 27.26 | 6.506 | 19 | 20 | 23 | 26 | 31 | 36 | 38 |
height
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1214 | 22 | 19 | 0.986 | 64.05 | 2.839 | 60 | 61 | 62 | 64 | 66 | 67 | 68 |
Value 53 54 56 57 58 59 60 61 62 63 64 65
Frequency 1 1 1 1 10 26 55 105 131 166 183 182
Proportion 0.001 0.001 0.001 0.001 0.008 0.021 0.045 0.086 0.108 0.137 0.151 0.150
Value 66 67 68 69 70 71 72
Frequency 153 105 54 20 13 6 1
Proportion 0.126 0.086 0.044 0.016 0.011 0.005 0.001
For the frequency table, variable is rounded to the nearest 0
weight
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1200 | 36 | 105 | 0.999 | 128.6 | 22.39 | 102.0 | 105.0 | 114.8 | 125.0 | 139.0 | 155.0 | 170.0 |
smoke
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 1226 | 10 | 2 | 0.717 | 484 | 0.3948 | 0.4782 |
Question 3
The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The variables in this data set are as follows.
| Variable Name | Variable Description | Variable Type |
|---|---|---|
case |
id number | num discrete |
bwt |
birthweight, in ounces | num continuous |
gestation |
length of gestation, in days | num continuous |
parity |
binary indicator for a first pregnancy (0 = first pregnancy) | categorical binary |
age |
mother’s age in years | num continuous |
height |
mother’s height in inches | num ordinal |
weight |
mother’s weight in pounds | num continuous |
smoke |
binary indicator for whether the mother smokes | categorical binary |
Question 4
Below, 2 numeric variables were investigated for potential relationships. The independent, explanatory variable I chose is variable_name, and the dependent, response variable I chose is variable_name.
df |>
ggplot(aes(x = gestation, # please change these
y = bwt)) +
geom_point()Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).
Describe what you see in your plot here.
The gestation points are between 250-300 days. It looks like as gestation gets longer, birth weight goes up as well. This means they are related. Gestation and birth weight have similar data since the graph has one big black spot.
Session Info
This portion of the document describes the conditions in RStudio under which this report was created. This is important to include so that work is reproducible by others.
xfun::session_info()R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
askpass_1.2.0 backports_1.5.0 base64enc_0.1-3
bit_4.0.5 bit64_4.0.5 blob_1.2.4
broom_1.0.6 bslib_0.8.0 cachem_1.1.0
callr_3.7.6 cellranger_1.1.0 checkmate_2.3.2
cli_3.6.3 clipr_0.8.0 cluster_2.1.6
colorspace_2.1-1 compiler_4.4.1 conflicted_1.2.0
cpp11_0.4.7 crayon_1.5.3 curl_5.2.1
data.table_1.15.4 DBI_1.2.3 dbplyr_2.5.0
digest_0.6.37 dplyr_1.1.4 dtplyr_1.3.1
evaluate_0.24.0 fansi_1.0.6 farver_2.1.2
fastmap_1.2.0 fontawesome_0.5.2 forcats_1.0.0
foreign_0.8-86 Formula_1.2-5 fs_1.6.4
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.1
glue_1.7.0 googledrive_2.1.1 googlesheets4_1.1.1
graphics_4.4.1 grDevices_4.4.1 grid_4.4.1
gridExtra_2.3 gtable_0.3.5 haven_2.5.4
highr_0.11 Hmisc_5.1-3 hms_1.1.3
htmlTable_2.4.3 htmltools_0.5.8.1 htmlwidgets_1.6.4
httr_1.4.7 ids_1.0.1 isoband_0.2.7
jquerylib_0.1.4 jsonlite_1.8.8 knitr_1.48
labeling_0.4.3 lattice_0.22.6 lifecycle_1.0.4
lubridate_1.9.3 magrittr_2.0.3 MASS_7.3.60.2
Matrix_1.7.0 memoise_2.0.1 methods_4.4.1
mgcv_1.9.1 mime_0.12 modelr_0.1.11
munsell_0.5.1 nlme_3.1.164 nnet_7.3-19
openssl_2.2.1 parallel_4.4.1 pillar_1.9.0
pkgconfig_2.0.3 prettyunits_1.2.0 processx_3.8.4
progress_1.2.3 ps_1.7.7 purrr_1.0.2
R6_2.5.1 ragg_1.3.2 rappdirs_0.3.3
RColorBrewer_1.1.3 readr_2.1.5 readxl_1.4.3
rematch_2.0.0 rematch2_2.1.2 reprex_2.1.1
rlang_1.1.4 rmarkdown_2.28 rpart_4.1.23
rstudioapi_0.16.0 rvest_1.0.4 sass_0.4.9
scales_1.3.0 selectr_0.4.2 splines_4.4.1
stats_4.4.1 stringi_1.8.4 stringr_1.5.1
sys_3.4.2 systemfonts_1.1.0 textshaping_0.4.0
tibble_3.2.1 tidyr_1.3.1 tidyselect_1.2.1
tidyverse_2.0.0 timechange_0.3.0 tinytex_0.52
tools_4.4.1 tzdb_0.4.0 utf8_1.2.4
utils_4.4.1 uuid_1.2.1 vctrs_0.6.5
viridis_0.6.5 viridisLite_0.4.2 vroom_1.6.5
withr_3.0.1 xfun_0.47 xml2_1.3.6
yaml_2.3.10